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Abstract —We propose a new algorithm for fast approximate 
nearest neighbor search based on the properties of ordered 
vectors. Data vectors are classified based on the index and sign 
of their largest components, thereby partitioning the space in 
a number of cones centered in the origin. The query is itself 
classified, and the search starts from the selected cone and 
proceeds to neighboring ones. Overall, the proposed algorithm 
corresponds to locality sensitive hashing in the space of directions, 
with hashing based on the order of components. Thanks to 
the statistical features emerging through ordering, it deals very 
well with the challenging case of unstructured data, and is a 
valnable building block for more complex techniques dealing with 
structnred data. Experiments on both simulated and real-world 
data prove the proposed algorithm to provide a state-of-the-art 
performance. 

Index Terms —Approximate nearest neighbor search, locality 
sensitive hashing, vector quantization, order statistics. 

I. Introduction 

A large number of applications in computer vision and 
image processing need to retrieve, in a collection of vectors, 
the nearest neighbor (NN) to a given query. A well-known 
example is image retrieval based on compact descriptors, but 
there are countless more, from patch-based image denoising, 
to copy-move forgery detection, data compression, and so 
on. Typically, large sets of points are involved, calling for 
fast search techniques to guarantee an acceptable processing 
time. However, for high-dimensional data, no exact NN search 
algorithm can provide a signihcant speed-up w.r.t. linear 
search, so one is forced to settle for some approximate search 
algorithms, trading off accuracy for efficiency. 

In recent years, there has been intense research on tech¬ 
niques that improve this trade-off. These can be classihed 
in two large families according to their focus on memory 
or time efficiency. A first family m, 0, a, a addresses 
the case of very large datasets, that do not fit in memory. 
This may occur, for example, in image retrieval and other 
computer vision applications. In this condition, performing an 
accurate search based on Euclidean vector distances would 
require data transfers from disk, with an exceedingly large 
delay. By associating compact codes with original vectors, and 
approximating exact distances based on such codes, one can 
drastically reduce memory usage and avoid buffering, with a 
huge impact on search speed. In this case, therefore, memory 
efficiency is the main issue and the prevailing measure of 
performance, while actual search time is of minor interest. 

On the contrary, when data and associated structures can 
ht in memory, processing time becomes the main issue, and 
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performance is measured in terms of accuracy vs speed¬ 
up, with memory usage assuming minor importance. This is 
the case of a large number of image processing application, 
especially in the patch-based paradigm. Notable examples are 
nonlocal denoising 0, exemplar-based inpainting 0, copy- 
move forgery detection Q and optical flow estimation 0. 
The present work hts in the latter family. That is, our aim is 
to provide a large speed-up with respect to linear search while 
still guaranteeing a very high accuracy based on Euclidean 
distance computation on the original vectors. 

Most techniques of this family follow a similar path, with 
many variations; hrst the search space is partitioned into 
suitable cell^ and all vectors are classihed based on the cell 
they belong to. Then, at run time, the query is itself classihed, 
and the NN is searched only among vectors falling in the same 
cell as the query, or possibly in some neighboring ones. Stated 
in these terms, approximate NN (ANN) search bears striking 
similarities with the vector quantization (VQ) problem 0, 
m. In both helds, dehning a good partition of the space and 
a fast classihcation rule are the key ingredients of success. 
Hence similar concepts and tools are used im. 

In the quantization literature there has been intense research 
on optimal (minimum distortion) space partitioning. The k- 
means algorithm provides a locally optimal solution 112, with 
cells well packed in the space and adapted to data point 
density. However, to classify a query, M vector distances 
must be computed, with M the partition size, making this 
approach unsuited to fast NN search. Therefore, a number of 
constrained solutions have been adopted in practice (see Eig|^. 
In hierarchical k-means, a tree-structured search is carried 
out, using partitions of size M' ^ M at each node, with 
a sharp reduction of complexity. The limiting case M' = 2 
corresponds to binary kd-trees search ifTSll . With binary de¬ 
cisions at the nodes, the dataset is recursively partitioned 
by fc-dimensional hyperplanes, allowing a very fast search. 
However, complex tree-visiting schemes are needed to obtain 
an acceptable reliability. In ifffil . ITSl a good performance is 
obtained by using multiple randomized kd-trees. 

Another popular approach is product quantization (PQ) 0, 
with its many state-of-the-art variants, inverted multi index 0, 
Cartesian k-means m, optimized PQ a, locally optimized 
pQini, composite quantization M- In PQ the vector space is 
decomposed into the Cartesian product of subspaces of lower 
dimensionality, and quantization is carried out independently 
in each subspace. Data points are then represented by compact 

'Often the cells overlap, so it is not a partition in strict sense: this is an 
example of the many variations which we will overlook, from now on, for 
the sake of readability. 
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Fig. 1. ANN search is tightly related to quantization and space partitioning. 
Some important concepts may emerge even with a simple 2d example. For 
structured data, unconstrained vector quantization (a) is optimal but usually 
too complex for real-world applications. Tree-structured VQ (b), and product 
VQ (c) reduce complexity but do not fully exploit data dependencies. For 
unstructured data, it can make sense resorting to product scalar quantization 
(d). However, lattice VQ (e) provides a better tessellation of the space. 
Basic LSH coiTesponds to non-oi1hogonaI product quantization (f), hence it 
is theoretically worse than both product and Lattice VQ. 


codes obtained by stacking the quantization indices of all 
subspaces. Thanks to these compact codes and to approxi¬ 
mate code-based distances, large datasets can be dealt with 
efficiently. The limiting case of scalar subspaces corresponds 
to the independent quantization of vector components. This 
solution is simple but largely suboptimal, and much better 
regular partitions of the space can be found CD. 

Interestingly, one of the most popular ANN search tech¬ 
niques, locality sensitive hashing (LSH) EOll . ETl . amounts 
to nothing more than scalar product quantization im, with 
data projected on randomly oriented, non-oithogonal, axes. 
Nonetheless, LSH can achieve a good performance through 
a number of clever expedients, like the use of multiple hash 
tables Ea, and multi-probe search ll22ll . Moreover, data- 
dependent variants of LSH improve largely over basic LSH, 
by learning projection axes from the data ES, ED, ES, and 
using non-uniform quantizers ED. 

In general, taking advantage of the intrinsic structure of data 
can greatly help speeding up the search. However, often data 
have little or no structure, or become unstructured after some 
preliminary processing. In hierarchical k-means, for example, 
as well in IVFADC El. after the top-level quantization, the 
points become almost uniformly distributed in their cells. 
Looking for the NN when data have no structure is an 
especially challenging task, encountered in many high-level 
problems, and therefore a fundamental issue of ANN search. 

In this work we propose a Reliable Order-Statistics based 
Approximate NN search Algorithm (ROSANNA) suited for 
unstructured data. Like in the methods described before, 
we dehne a suitable partition of the space, and carry out 
classification and priority search. The main innovation consists 
in classifying the data based just on the index and sign of their 
largest components. This simple pre-processing allows us to 
partition the search space effectively, with negligible complex¬ 


2 


ity and limited impact on memory usage. For each query, only 
a short list of candidates is then selected for linear search. 
ROSANNA can be also regarded as a variant of LSH, where 
the hashing is based on the vector directions. This makes 
full sense in high dimensions, since data tend to distribute 
on a spherical shell. ROSANNA produces a uniform partition 
of the space of directions, and all vectors are automatically 
classified based on the order of their sorted components. By 
using multiple hash tables, obtained through random rotations 
of the basis, and a suitable visiting scheme of the cells, a 
very high accuracy is obtained, with signihcant speed-up w.r.t. 
reference techniques. Experiments on both unstructured and 
real-world structured data prove ROSANNA to provide a state- 
of-the-art search performance. We also used ROSANNA to 
initialize the NN held in copy-move forgery detection, a real- 
world computation-intensive image processing task, obtaining 
a signihcant speed-up. 

In the vast literature on ANN search, several papers re¬ 
lated with ROSANNA have obviously appeared. For example, 
sorting has been already exploited by Chavez et al. ESj . 
This technique, however, is pivot-based rather than partition- 
based. A number of anchor vectors, or “pivots” are chosen in 
advance. Then, for each database vector the distances to all 
pivots are computed and sorted, keeping track of the order 
in a permutation vector. At search time, a permutation vector 
is computed also for the query and compared with those of 
the database points to select a short list of candidate NNs, 
assuming that close vectors have similar permutation vectors. 
In summary, sorting is used for very different goals than in 
the proposed method. 

If ROSANNA is regarded as LSH in the space of directions, 
then the same goal is pursued by the Spherical LSH (SLSH) 
proposed in E3, (not to be confused with the unrelated 
Spherical Hashing ED), where a regular Voronoi partition 
of the unit hypersphere is built based on the vertices of an 
inscribed polytope. For example, working in a 3d space, the 
unit sphere can be partitioned in 8 cells corresponding to 
the 8 vertices of the inscribed cube. Therefore, in SLSH, a 
regular partition is obtained with a low-complexity hashing 
rule ED- Unfortunately, in high-dimensional spaces {K > 5) 
there are only three kinds of regular polytopes, simplex, with 
K+l vertices, orthoplex, with 2K vertices, and hypercube 
with 2^ vertices. SLSH uses eventually only the orthoplex 
polytope, which corresponds to a strongly constrained version 
of ROSANNA. Likewise, Iterative quantization (ITQ), pro¬ 
posed in 0 for ANN search through compact codes, makes 
reference to hyper-octants, and hence corresponds to another 
(opposite) constrained version of ROSANNA. By removing 
such constraints, ROSANNA is able to provide much better 
results. 

In Concomitant LSH 12^ . instead, a Voronoi partition of the 
space of directions is built based on a set of M points taken at 
random on the unit sphere. Although originally proposed for 
cosine distance, it presents some similarities with ROSANNA, 
the use of directions and sorting, and can be easily adapted 
to deal with the Euclidean distance. However, to classify the 
query, M distances must be computed at search-time, before 
inspecting the candidates, which is a severe overhead for 








































IEEE TRANSACTIONS ON IMAGE PROCESSING 


3 


p 

c 

Xi 

X2 

^3 


0 

-22 

12 

5 


-21 

-19 

-12 

1 


29 

24 

-13 

1 

44 

17 

-4 


49 

-6 

5 



57 

8 

-2 


0 

-3 

-18 

10 


-1 

-13 

0 

2 


5 

11 

4 


1 

11 

14 

-3 



14 

25 

23 



-36 

23 

-47 


0 

5 

26 

-27 

3 

9 

-2 

-17 



12 

5 

-14 


1 

-7 

11 

22 


P 

c 

Xi 

X2 

0=3 


0 

-21 

-19 

-12 


-1 

-13 

0 


1 

-22 

12 

5 

1-2 

2 

49 

-6 

5 


5 

11 

4 



11 

14 

-3 


3 

29 

24 

-13 



44 

17 

-4 



57 

8 

-2 


0 

-36 

23 

-47 

1-3 

2 

9 

-2 

-17 



12 

5 

-14 


1 

-3 

-18 

10 

2-3 

2 

5 

26 

-27 


14 

25 

23 



-7 

11 

22 


Fig. 2. Two different organization of the same dataset. Left: only the largest 
component is used for classification, G=l. Profile p={l} includes all vectors 
where the first component, xi, is largest. The profile is divided in two cones, 
c=0, including vectors with xi < 0, and c=l, including the others. Right: 
the two largest components are used for classification, G=2. Profile p={l- 
2} includes all vectors where xi and X 2 are the largest components. The 
profile is divided in four cones, according to the sign of these components. 
The largest G components are shown in bold and color. 


large hashing tables. To reduce this burden, several variants 
are also proposed in ll29l which, however, tend to produce a 
worse partition of the space. Irrespective of the implementation 
details, the regular partition of the space of directions provided 
by ROSANNA can be expected to be more effective than 
the random partition used in Concomitant LSH. Moreover, in 
ROSANNA, the hashing requires only a vector sorting, with 
no distance computation. As for concomitants (related to order 
statistics) they are only used to carry out a theoretical analysis 
of performance, but are not considered in the algorithm. 

Recently, Cherian et al. proposed to use sparse ANN codes 
1301, where each data point is represented as a sparse combi¬ 
nation of unit-norm vectors drawn from a suitable dictionary. 
The indexes of the selected dictionary vectors represent a short 
code used to speed up retrieval. If a low-coherence dictionary 
is designed ED, close data points tend to fall in the same 
bucket, a cone identified by the selected dictionary vectors, 
which is then searched linearly. This technique (SpANN) is 
explicitly designed to deal with datasets with large nominal 
and low intrinsic dimensionality (i.e., sparse). Therefore, it 
is ineffective with unstructured data. In this case, the extra 
efforts of designing a low-coherence dictionary (off-line), and 
hnding the sparse code of the query (at search time) are 
basically useless. ROSANNA can be seen as the limiting 
case of SpANN when the vocabulary vectors are uniformly 
distributed over the space of directions and a single vector is 
used to approximate a data point. 

In the following, we first describe the basic algorithm and 
explain its rationale (Section 2), then describe the full-fledged 
implementation (Section 3), discuss experiments on simulated 
and real-world data (Section 4) and hnally draw conclusions. 





Fig. 3. Atomic orbitals Px-,Py and pz (top row) resemble (loosely) the three 
profiles aiising in a 3d space with G=l, including two cones each. Likewise 
orbitals dxy, dxz and dxz (bottom row) resemble the three four-cone profiles 
of the case G=2. The same color coding as figure 2 is used. 


II. ANN SEARCH BASED ON ORDER STATISTICS 

Given a set of N vectors x„ G n = drawn from 

a common source, we look for the nearest neighbor to query 
y drawn from the same source, according to the Euclidearpl 
distance, 

||xnn - y|p < lix„ - yf, n= (1) 

We organize in advance the dataset in disjoint sets, called 
prohles, based on the index of the G vector components that 
are largest in absolute valu^ For example, taking G=l, we 
define K disjoint profiles, with profile j including only the 
vectors for which the j-th component is the largest 

Pj ~ {Xn • laiyjjj ^ ^ = 1, ■..; at} (2) 

Within each profile, we further divide the vectors in subsets, 
called cones, according to the sign of the largest compo¬ 
nents, for example only two cones if G=l. Fig|^ shows two 
alternative organizations of a toy dataset, composed by 16 
3d vectors, in the cases G=1 and G=2. Taking advantage of 
some well-known pictures of atomic orbitals, Fig|^ provides 
an approximate representation of profiles and cones in the 3d 
space, again for G=1 and G=2. 

At run-time, the query y is itself classified, based on index 
and sign of its G largest components, searching for the nearest 
neighbor only in the corresponding profile and cone. In our 
classification, we tell apart the largest G components from 
the K-G smallest ones, but do not sort components within 
these groups. Therefore, there are (g) distinct profiles and, 
Nc = cones, counting 2*^ cones for each profile. In 

conditions of perfect symmetry for the source, Nc represents 
also the average speed-up, measured as the ratio between 
dataset size N and number of vectors searched. For the 
example dataset of Fig|^ Nc equals 6 when G=l, and 12 
when G=2. 

This brief description elicits some natural questions: is there 
potential for significant speed-up with this approach? Is the 
NN really likely to belong to the same cone as the query? 

^The angular distance is also appropriate for ROSANNA, but we focus on 
the Euclidean distance for its higher relevance in real-world problems. 

^In the following, we omit “in absolute value” when obvious. 
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Fig. 4. Scatter plots of the first two components of 512 8d Gaussian vectors. 
Left: randomly selected vectors. Right: vectors where the first two components 
are the largest ones. 


Both questions may have a positive answer when we move to 
high-dimensional spaces, if ^ 1, thanks to the properties of 
order statistics. 

Concerning speed-up, it is important to note that the number 
of cones grows very quickly with K and G. For example, 
it is almost 30000 for K=16 and G=4. Note that K=16 is 
considered a relatively small dimensionality for NN search 
problems. By increasing the number of cones, one can reduce 
at will the average number of vectors per cone, hence the 
search time. Needless to say, in doing so, one should always 
guarantee a high probability that the NN is actually found in 
the searched cone (or cones). 

To analyze this point, let us focus on the specific case 
of vector components modeled as independent identically 
distributed (i.i.d.) random variables with standard Gaussian 
distribution, Xi ~ JV{0, 1). This is arguably a worst case for 
the NN search problem, as there is no structure in the data to 
take advantage of. It can be appreciated, for example, in the 
left scatter plot of Fig|^ showing the first two components 
of 512 8-dimensional vectors with i.i.d. standard Gaussian 
components. In the same plot we also show, in red, the first 
two components of a query drawn from the same distribution. 
Clearly, there is not much structure in the data to help speeding 
up the NN search. The right scatter plot of Fig|^ is obtained 
as the left one, except that we now include only vectors such 
that the first two components are also the largest ones. The 
difference between the two scatter plots is striking; sorting the 
components has created a structure in the data, which can be 
exploited to speed-up the search. In particular, we can safely 
restrict attention to only one of the four emerging clusters (our 
cones), comprising vectors where the sign of the two largest 
components is the same as in the query, gaining therefore a 
factor four in speed. Still, it is not impossible that the true 
NN belongs to a different cluster (it depends on the remaining 
components) but there is an energy gap to overcome, due to 
the query-cluster distance in the first two components. 

Of course, after sorting, the components are not identically 
distributed anymore, and certainly not independent on one 
another. Given the probability density function (pdf) of the 
original components, fx{x), we can easily compute the pdf of 
the sorted components. Let Ai = \Xi\ be the absolute value of 
the i-th component, and the z-th component of the sorted 
vector of absolute values, such that Note that 




Fig. 5. pfd of the components of an 8-dimensionaI vector of i.i.d. standard 
Gaussians after sorting for decreasing magnitude. The largest components 
(top) have a high variance and are clearly bi-modal. The first component 
alone holds 44% of the vector energy, the first four almost 90% of it. 


A(i) is smaller than a given value x only if all the A[s are. 
Likewise, ^(i) is smaller than x only if at least K — i + 1 
of the A[s are. Based on such observations, and due to the 
independence of the A[s, we can compute the marginal pdf 
of the sorted absolute values (see Appendix A) 

K\ 

X [FA{x)f-^[l - FA{x)r^fA{x) (3) 

where F( ) denotes cumulative distribution function (CDF). 
Then, given 0, we readily obtain the pdf of the original 
components after sorting them by decreasing magnitude. 

In our example, the components are standard Gaussian, 
with CDF expressed in terms of the Q-function ll^ as 
Fxiix) = 1 —(5(a;). For the case Ar=8, Fig|^shows the pdf of 
all components, which are very different from one another. The 
first components (top) have a much larger variance than the 
last ones, holding most of the vector energy. Therefore, they 
impact heavily in the computation of the Euclidean distance 
w.r.t. a given query, while the last ones are almost negligible. 
Moreover, the largest components are markedly bimodal, with 
modes growing farther apart as K grows. 

This figure provides, therefore, some more insight into the 
rationale of our approach. We are trying to classify vectors 
beforehand, in a sensible way, to reduce the search space. 
Doing this by taking into account all components with equal 
importance would be impractical (or infeasible) as K grows 
large, and not much reliable, because most components are 
scarcely informative. Therefore, we focus only on the largest 
components, those holding most of the energy and of the 
information content, obtaining a much smaller (and tunable) 
number of classes and, eventually, some stronger guarantee 
that the NN will indeed belong to the same profile as the query. 
As a matter of fact, we chose the name “profile” for analogy 
with the actions naturally taken to identify a person based 
on a summary description, focusing on the most prominent 
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Fig. 6. Scatter plots of 2-dimensional i.i.d. Gaussian vectors. At very high 
density (left, p=4) the NN belong almost always to the query’s cone. This 
may not happen at lower density (right, p=2), especially if the query is near 
to the cone boundaries. A rotation of coordinates brings the query near the 
center of the new cone (dash-dot lines), with the NN in the same cone. 


Algorithm 1 ROSANNA (NN search) 

Require: y 

> query 

Ensure: NN > index of approximate nearest neighbor 

for r = 1 : do 

[> for each rotation 

compute 

[> projection of y on r-th basis 

find {c^i\ ..., 

D> ordered list of cones to be visited 

end for 


for I = 1 : C do 

> C cones visited for each basis 

for r = 1 : R do 


for each x S c[’'^ do 


if X not analyzed then 

> boolean side information 

compute II X — y ||^ 

> with partial distance elimination 

update NN 


mark x as analyzed 


end if 


end for 


end for 


end for 



features, “...he had the most unusual nose...”, “...she had a 
curious accent...”, to reduce the search space while preserving 
accuracy. Given the profile, and assuming the NN is actually 
found in that class, the analysis of signs restricts the search 
very reliably on the cone of interest. In fact, since the largest 
components have such a strongly bimodal distribution, it is 
very unlikely that the smallest components cause a cone 
switch. 

Taking a different point of view, ROSANNA can be seen 
as a form of locality sensitive hashing. Component sorting 
becomes just a means to determine algorithmically a partition 
of the space based on vector direction. Given the identity of the 
G out of K largest components, a data vector is automatically 
associated with one of the cells of the partition, and the same 
happens with the query. With G=l, the space in divided in 
2K cells, our cones, which become 2K{K — 1) with G=2, 
and so on, up to 2^ cells for G=K. It is worth underlining 
that the space partition is, by definition, completely symmetric, 
and induces a partition of the unit hyper-sphere with the 
same property. In Appendix B we characterize the proposed 
approach in terms of collision probability. 

III. Implementation 

We now turn the naive basic algorithm into a reliable 
ANN search tool. The weak point in the basic version is the 
assumption that the NN is found exactly in the cone singled 
out by the query. This is quite likely if the dataset has a high 
density of points, 

p = \og^N/K (4) 

as in the 2d example on the left of Fig|^ much less so in 
the case of lower density, shown on the right. This latter plot 
shows how the NN may happen not to be in the query’s cone, 
especially if the query lies near the boundary of the cone and 
not in the very center. This might look as a rare unfortunate 
case. However, in high-dimensional spaces, this is actually 
quite likely, especially at low density. As a matter of facts, 
even in the right plot the point density is actually quite high, 
while in most real-world applications densities in the order of 
p=\ or even lower are to be expected, in which case the NN 
may easily happen to be far from the query. Fig|^ however, 
suggests also possible countermeasures, amounting basically 


in considering alternative bases (see the dash-dot lines on the 
right), obtained through rotation, or including also neighboring 
cones in the search. 

To make our OS-based search reliable we resort therefore 
to some typical expedients of LSH methods, enlarging the set 
of candidate points and exploring them with suitable priority. 

A. Using multiple bases 

To increase the reliability of our search algorithm we deal 
first with the boundary problem. Although this is not obvious 
in the 2d case, in higher-dimensional spaces it is quite likely 
that the query lies far from the center of its cone. When this 
happens, the probability that the NN belongs to a different 
cone is quite large, exceeding 1/2 when the query lies exactly 
on a boundary. To address this problem, we consider multiple 
reference systems, obtained from one another through random 
rotations, like in llT4l . and look for the NN in the union of all 
the cones where the query belongs. This solution corresponds 
to the use of multiple hash tables in LSH algorithms, and 
presents the same pros and cons. The probability of finding the 
NN in the enlarged cone is much higher than before, but there 
is a processing cost, since the query is projected on multiple 
bases and more points are checked, and a memory overhead, 
due to the need to store multiple classifications. 

B. Checking neighboring cones 

Using multiple bases increases the probability of finding 
the NN in the query’s enlarged cone, but there is still a 
non-negligible probability of missing it, especially in the 
low-density case. Therefore, it can make sense to extend 
the search to some close cones, as far as a positive time- 
accuracy trade-off is kept, which is the multiprobe search 
used in LSH methods ||22|. Rather than computing the actual 
Euclidean distances between the query and candidate cones, 
we exploit the intrinsic structure of profiles defined for various 
values of G. Let ii,i 2 ,.. ■ Gk be the indexes of the query 
coordinates sorted by decreasing magnitude. Therefore, for a 
given value G, the query belongs to the profile identified by 
{ii,i 2 ,. ■., *g}- The most likely reason why the NN may not 
belong to the same profile is that its G-th coordinate differs 
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Fig. 7. i.i.d. Gaussian data, p=l. Sign agreement between query and NN (top- 
left) is quite high, especially with multiple rotations. Classification accuracy 
(other graphs), instead, is guaranteed only with a small number of query 
components, F, and many rotations. 

from the query’s, so we should begin by looking in all profiles 
sharing the first G —1 indexes with the query’s profile but 
differing in the last one. Note that all such level-G profiles 
are “children” of the same level-(G—1) profile, identified 
by {*1, *2, ■ ■ ■, *G-i}- Therefore, we define a distance between 
two level-G profiles, pi and p 2 , as G minus the level of their 
closest common ancestor. Based on such a distance, for each 
rotation an ordered list of cones to be visited is established. 

A compact simplified pseudo-code of ROSANNA (only the 
search phase) is shown in Algorithm 

C. Preliminary assessment of reliability 

We ran a few experiments to gain insight into the importance 
of using multiple bases and searching neighboring cones, and 
to assess the reliability of the proposed search algorithm. 

We consider a running example with i.i.d. Gaussian com¬ 
ponents, with K=\6 and log2iV=16, hence density p=l, and 
report in Fig|7] (top-left) the probability (MonteCarlo esti¬ 
mates) that the first G components of query and NN have the 
same sign. Using a single basis, this probability is very large 
only for small values of G: for example, while it is almost 
certain that the first 3 components have the same sign, the 
probability drops to about 60% for G=8. Note that for G=16, 
the probability is almost 0, showing that using signs for clas¬ 
sification without prior sorting, like in ITQ, is prone to errors. 
Using multiple bases guarantees a significant improvement, 
and already with 8 rotations the first 8 components of query 
and NN have almost certainly the same sign. 

The top-right plot of Fig|7] reports the probability that 
the largest component of the query (F=l) is among the G 
largest components of the NN. The probability is just above 
40% for G=l, that is, even in this relatively high-density 



Fig. 8. i.i.d. Gaussian data, 16 rotations. Sign agreement between query 
and NN (top-left) is high at all tested densities. Classification accuracy (other 
graphs) is limited at low density if more than two query components are used. 


case, we cannot fully trust the largest component for reliable 
classification. If we look also in profiles where the largest 
component of the query is only the second, third, ..., G-th 
largest, the probability of finding the NN grows, but not as 
quickly as we might hope. However, by using multiple bases, 
the whole curve drifts rapidly towards 1, becoming almost 
flat for 8 rotations. To increase search efficiency, however, 
classification should be carried out based on more components, 
F > 1. The bottom plots report, therefore, the same data with 
reference to the two and three largest components of the query, 
respectively. In both cases, the classification is very reliable 
for small values of G if 8-16 rotations are used. 

In Figj^ we explore the dependence on dataset density, 
going from 2 (high) to 1/2 (quite low), with 16 rotations. The 
density is modified by keeping log 2 ~ 16 and varying the 
vector length K. In any case, results are shown for G < 16, for 
clarity. The top-left plot shows that sign agreement between 
query and NN is always guaranteed, at least for the first 8 
components. The other plots report the probability that the 
largest 1, 2, 3 components, respectively, of the query are 
among the G largest of the NN. This probability is very 
high for F=l at any density, and for any F at high density. 
At low density, instead, classification becomes unreliable for 
F > 2 unless a large value of G is used, which is impractical, 
however, as it would require visiting a huge number of profiles. 
Therefore, we cannot expect the search to be much reliable 
in this case. On the other hand, the same holds (to the best 
of our knowledge) for all other ANN search methods at 
such low densities. Indeed, most large-dimensionality sources 
considered in the literature present strong dependencies which 
reduce significantly the intrinsic data dimensionality lEl, El, 
a, allowing for an effective ANN search. 
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accuracy 



0.99 


Fig. 9. Experimental results at p=l with Gaussian (left), Uniform (center), and Laplace (right) i.i.d. data. In all cases ROSANNA, outperforms all references, 
sometimes by an order of magnitude, and results are only weakly affected by the data pdf. 


N number of dataset vectors 

K vector length 

G number of components used for query 

Nc number of cones, Nc = (q)2® 

R number of rotations (hash tables) 

C number of cones visited for each rotation 


TABLE I 

Main parameters of the algorithm. 


D. Assessment of complexity 

We can now provide a theoretical assessment of compu¬ 
tational complexity, keeping in mind, however, that some 
processing steps include random components that may af¬ 
fect results significantly. To this end, Tab|I] lists the main 
parameters of the algorithm and the associated symbols, while 
Tab|n|reports the complexity assessment as a function of these 
quantities. 

The dataset preparation phase is normally of minor interest 
since it is carried out off-line once and for all. Considering that 
N is much larger than K, the dominant term of this phase 
is the rotation of the dataset points along the R bases. We 
are considering the use of structured orthonormal matrices, 
like DCT or Walsh-Hadamard, in place of pseudo-random 
matrices, which may reduce this cost. Hashing, instead, has 
more limited cost, related to vector sorting. 

For on-line NN search the most critical item is typically 
the linear search of the candidate short-list where the distance 
from query to all candidates must be computed. However, 
the corresponding entry in Tab[n| is only an approximation, 
based on the assumptions that dataset points are uniformly 
distributed among the cones, and that the visited cones in¬ 
clude disjoint sets of points. The first assumption is pretty 
reasonable, the second much less. When multiple rotations 
are used, it is very likely that some points are visited more 
than once, in which case the distance is not computed anew. 
Therefore, our estimate is a bit pessimistic, but how much 
so depends on many parameters. We can however single out 
best and worst cases. In the best case, the shortlist includes 
only one candidate, and search complexity is dominated by the 
cost of query rotation 0{RK^). In the worst case, the shortlist 


phase 

action 

complexity 

dataset preparation 

generate rotation matrices 
rotate dataset points 
hash dataset points 

O(RK^) 

O(NRK^) 

0{NRK\og(K)) 

NN search 

rotate query 
hash query 

search short-list 

O(RK^) 

0{RK \og{K)) 
0{RCK{N/Nc)) 


TABLE II 

Complexity assessment. 


includes all dataset points, coming down to linear search, with 
cost NK. 

In next Section, ROSANNA will be applied also to long 
vectors (e.g., 128 components) reduced to shorter unstructured 
vectors through PCA and random rotation. Taking into account 
also the estimation of covariance matrix and the PCA, and 
the use of partial distance elimination techniques, the above 
analysis holds with minor adjustments also in such a case. 

IV. Experimental analysis 

We carried out a number of experiments to assess the per¬ 
formance of the proposed method. Results are given in terms 
of accuracy-efficiency plots, as in ca, m with accuracy 
meant as the probability that the selected point is the actual NN 
(also called precision and recall® 1), and efficiency measured 
in terms of speed-up w.r.t. linear search. The software is 
written in C-H- using open libraries and some routines of the 
FLANN packag^ and is published onlin^to guarantee full 
reproducibility of results. There are only a few parameters to 
set: the number of components used for classification, G, the 
number of rotations, R, used for multiple-basis search, and the 
number of visited cones per basis, C. In the preparation phase, 
for each rotation, all dataset points are projected on the new 
basis and classified according to the index and sign of their G 
largest components. Therefore we need R hash tables, with a 
relative memory overhead of R/K. A second-level hash table 

^ http://www.cs.ubc.ca/research/flann/ 

^ http://www.grip.unina.it 
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Fig. 10. Experimental results with Gaussian i.i.d. data at high (p=1.5, left), medium (p=l, center), and low (p=0.75, right) density. Performance depends 
strongly on density. ROSANNA works much better than the references at high density, and is on par with HKM at low density. In the latter case, a very 
limited speed-up is obtained anyway. 


is used to manage the tables when the number of cones grows 
very large. 

We compare results with a number of relevant state-of-the- 
art references: i) plain Euclidean LSH (E2LSH) ll?5l with 
the implementation available onlin^ including the automatic 
setting of most parameters; ii) randomized kd-trees (RKDT) 
and, in) hierarchical k-means (HKM), both implemented in the 
ELANN package ifTSll . together with iv) ELANN itself, which 
is always inferior to the the best of RKDT and HKM but 
sets automatically all parameters; v) the IVFADC algorithm 
m, based on product quantization, and implemented by us 
starting from the Matlab code published by the author^ and 
hnally vi) Inverted Multi-Index (IMI) m developed by the 
author^ except for the final linear search among candidates, 
which we carried out as in ROSANNA. All these techniques 
are implemented in C-n- language. We also implemented 
and run optimized PQ, both parametric and non-parametric 
a, but did not include results, generally worse than those 
of the PQ-based IMI and IVFADC, in order not to clutter 
further the hgures. Curves are obtained (except for ELANN) 
as the upper envelope, in the accuracy-time plane, of points 
corresponding to different parameter settings. For ROSANNA, 
we consider G G {1,2,..., K/2}, R G {1, 2,4, 8,16}, C G 
{1, 2,4,..., 128}, and similar wide grids are explored for 
the main parameters of all other techniques. To focus on the 
more interesting high-accuracy range, in all graphs we use a 
logarithmic scale for both accuracy and speed-up. 

A. Unstructured data 

Fig|^(left) shows results for our running example, Gaussian 
i.i.d. data, fc=16, p=\. ROSANNA guarantees uniformly the 
best performance, being almost twice as fast than the second 
best, HKM, at all levels of accuracy, and much faster than 
all the other references, gaining a full order of magnitude 
w.r.t. RKDT and E2LSH. The ELANN curve lies somewhat 
below HKM, since the parameters are selected in advance and 

^ http://www.mit.edu/~andoni/LSH/ 

'http://people.rennes.inria.fr/Herve.Jegou/projects/ann.html 

* http://arbabenko.github.io/Multilndex/ 


may turn out not to be the best possible. However, we could 
not compute results for ELANN at high accuracy due to the 
large time needed to optimize the parameters. In Fig|9] we 
also shows results obtained in the same conditions as before 
but using Uniform (left) and Laplace (right) random variables 
in place of Gaussian. The general behavior is the same as 
before, with slight improvements observed in the Uniform 
case, probably due to the smaller entropy. 

With Fig[^ we go back to the Gaussian case, but change 
the dataset density considering p =1.5 (left), p =1.0 (center) 
as before, for ease of comparison, and p =0.75 (right). As 
expected, ROSANNA works especially well at high density, 
while its performance becomes very close to the best reference, 
HKM, at lower density. In this latter case, however, a signif¬ 
icant speed-up can be obtained only at pretty low accuracy, 
whatever the technique used. 

B. Structured data 

Previous experiments confirm that ROSANNA works very 
well with unstructured data. It can be argued, however, that 
most real-world datasets are highly structured, and often 
have large dimensionality. Therefore, we now consider some 
popular structured sources, SIFT descriptors ||36l, and MNIST 
images of handwritten digits, often used to test the perfor¬ 
mance of ANN search algorithms. In particular we will use 
the lOOK-vector UBC SIFT datasefj the IM-vector IRISA 
SIFT datasej^ and the 60K-vector MNIST databas^^ with 
the train/test split coming with each one. SIFT vectors have 
length 128, while MNIST images comprise 784 pixels. In both 
cases, we search the datasets based on reduced-dimensionality 
vectors. First, we compute the PCA and project the points 
on the new basis, and then classify the data based only on 
the hrst 16 components, which account for a large fraction 
of the energy (about 70% for the SIFT datasets and 60% 
for the MNIST database). In any case, the NN is searched 
among the selected candidates by computing distances over 

'http://people.cs.ubc.ca/~mariusm/uploads/FLANN/datasets/siftl00K.h5 

http://corpus-texmex.irisa.fr/ 

*' http://yann.lecun.com/exdb/mnist/ 
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Fig. 11. Experimental results with real-world data. Left: lOOK UBC SIFT; center: 60K MNIST; right: IM IRISA SIFT. ROSANNA outperforms almost 
uniformly all reference techniques in the first two cases. The same happens in the third case after k-means clustering (ROSANNA-f). 




Fig. 12. Normalized eigenvalues (left) and their cumulative sum (right) for 
two SIFT datasets. Energy distribution is highly skewed, with more than 70% 
of the energy in the first 16 components. The intrinsic data dimensionality is 
much smaller than 128. 

all components, using partial distance elimination (PDE) llJTll 
to speed-up the process. 

The use of PCA is motivated by the need to reduce com¬ 
plexity and, a posteriori, by experimental evidence. However, 
it is also justified by the observation that the intrinsic dimen¬ 
sionality of real-world data is typically much smaller than their 
nominal dimensionality. SIFT descriptors, for example, have 
a nominal dimensionality of 128, but the components are also 
strongly correlated. In Fig|^we show, for both the UBC and 
the IRISA datasets, the first 48 normalized eigenvalues Ai (a), 
which account for the distribution of energy among the vector 
components after taking the PCA, and their cumulative sum 
(b). In both cases, and especially for IRISA, the distribution 
is very far from uniform, with most eigenvalues very close to 
zero, and 70% of the total energy in the first 16 components. 
Therefore, the intrinsic data dimensionality is much smaller 
than 128. A rough estimate is 

D* = (5) 

where H{p) is the informational entropy, and pi = 
Ai/^j(Ai). With this definition, the intrinsic dimensionality 
drops to 38.69 for UBC and 27.94 for IRISA. Of course, this 
estimate neglects non-linear dependencies, quite significant 
for SIFT data (and MNIST as well), so the true intrinsic 
dimensionality is arguably even smaller. 

Results are reported in FigjTT] In general, a much larger 
speed-up is obtained w.r.t. to the case of unstructured data 


(notice the decade shift on the y-axis) and no obvious loss 
of accuracy is observed due to the classification performed 
only on 16 components. On both the UBC SIFT and MNIST 
datasets, ROSANNA outperforms all reference techniques in 
the medium-accuracy and especially high-accuracy range. In 
the 0.9-0.99 accuracy range, it is about twice as fast as 
the best competitors. The situation changes with the IRISA 
SIFT dataset, where all techniques, including ROSANNA 
(and except F2LSH), provide a comparable performance. The 
reason lies in the much stronger structure of the IRISA 
data, where the first PCA component accounts for almost 
33% of the total energy, as opposed to just above 10% for 
UBC data (the dataset size is, instead, immaterial, as the 
same behavior is observed with lOOK vectors). This is not 
surprising, since ROSANNA is not designed to exploit data 
dependencies. In this case, the best performance is provided by 
IVFADC, which performs a preliminary k-means clustering, 
thus exploiting major data dependencies, before resorting to 
product quantization within a restricted number of clusters. We 
resorted therefore to a similar solution to adapt ROSANNA to 
the case of highly dependent data. Data are clustered off-line 
by k-means. At search time, the query is compared with the 
cluster centroids, and only the nearest clusters are analyzed 
with ROSANNA, collecting the candidates that are eventually 
searched for the NN. The overall search time is roughly halved 
w.r.t. the basic version, providing much better results than all 
references, including IVFADC, especially at high accuracies. 
Preliminary k-means clustering is instead ineffective with the 
less structured UBC SIFT and MNIST data. 

We conclude this analysis by showing, in Tab[nl| inspired 
to Tab.i of na, some numerical performance figures of 
ROSANNA for the lOOK UBC-SIFT dataset as a function 
of its main parameters. In the first line we consider a pivot 
configuration with speed-up 100, while the following six 
lines (labeled G—, G-f, R—, R-\-, G—, G-f) provide some 
insight into the effect of increasing/decreasing only one of the 
parameters at a time w.r.t. the pivot. By increasing G, narrower 
cones are generated, leading to faster search (remember that 
the other parameters are fixed) but lower accuracy, while the 
opposite happens when G decreases. Operating on R and G 
produces similar effects, slower search and higher accuracy 
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configuration 

parameters 

G R G 

accuracy 

search 

speed-up 

memoiy 

overhead 

build 

time 

pivot 

4 

8 

4 

0.905 

100 

0.36 

0.36 

G- 

3 

8 

4 

0.961 

37 

0.16 

0.35 

G-y 

5 

8 

4 

0.814 

168 

0.69 

0.36 

R- 

4 

4 

4 

0.788 

180 

0.18 

0.26 

R-V 

4 

16 

4 

0.966 

54 

0.71 

0.57 

G- 

4 

8 

2 

0.841 

145 

0.36 

0.35 

G-y 

4 

8 

8 

0.946 

66 

0.36 

0.36 

high speed 

6 

2 

16 

0.595 

404 

0.24 

0.20 

high accu. 

3 

16 

8 

0.999 

14 

0.30 

0.56 

low memory 

3 

1 

128 

0.901 

18 

0.03 

0.17 


TABLE III 

Performance figures for various parameter configurations. 
Memory overhead and build time are relative to dataset 

OCCUPATION AND LINEAR SEARCH TIME FOR THE TEST SET, 
RESPECTIVELY. 


when the parameter grows, and the opposite when it decreases. 
In all these cases, there is a nearly linear relationship between 
accuracy and speed-up, when one grows the other decreases. 
Memory overhead, instead, exhibits a more varied behavior. It 
remains constant when operating on C, since the data structure 
does not change. It is positively correlated with speed when 
operating on G, because faster search is obtained by defining 
more cones. On the contrary, it is negatively correlated with 
speed when operating on R, because faster search is obtained 
by using less rotations. Therefore, one can keep memory low 
both when high-speed is required (reducing R) and when high 
accuracy is required (reducing G). This is reflected in the 
next two configurations, selected to guarantee high speed, and 
high accuracy, respectively, where memory overhead is always 
relatively low. The last configuration instead has almost no 
memory overhead and still a good performance. 

C. Fast copy-move forgery detection 

With the diffusion of powerful image editing tools, image 
manipulation, often with malicious and dangerous aims, has 
become easy and widespread. Copy-move is one of the most 
common attacks, where a piece of the image is cut and paste 
somewhere else in the same image to hide some undesired 
objects. Using material drawn from the same target image, in 
fact, raises the likelihood to escape the scrutiny of a casual 
observer. 

The most effective copy-move detectors GSl, 113, Q are 
based on dense feature matching. A feature is associated with 
each block, and the most similar feature is searched for over 
the whole image. Eventually, a dense held of offsets linking 
couple of pixels is obtained which, after some suitable post¬ 
processing, may reveal the presence of near-duplicate regions. 
By using scale/rotation invariant features, copy-moves can 
be effectively detected even in the presence of geometrical 
distortions. 

Feature matching is the most computation-intensive phase of 
copy-move detection algorithms. In Q, this task is carried out 
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Fig. 13. An example of copy-move forgery detection results. Top: original 
and forged images. Bottom: color-coded masks output by the original algo¬ 
rithm (left) and its modified version with ANN initialization (right). In both 
cases, despite rotation, all copied regions are detected (green) with very high 
accuracy, and no false alarm (red) is present. 

by resorting to PatchMatch i3, a fast randomized iterative 
algorithm which outputs a smooth and accurate offset held. 
Nonetheless, this process turns out to be relatively slow, since 
several iterations of PatchMatch are needed for convergence. 
This may be a serious problem when a large number of 
images must be analyzed in search of possible manipulations, 
especially with today’s multi-million pixel images. To speed¬ 
up this phase, one can improve the initialization of PatchMatch 
(random in the original algorithm) by means of approximate 
NN search tools, as done for example in im. This may 
be effective, indeed, if the ANN algorithm is itself fast and 
reasonably accurate. We show next that ROSANNA, with a 
suitable choice of the parameters, may accomplish very well 
this task. 

We use the copy-move detection algorithm proposed in 
||7l . referring the reader to the original paper for all details. 
For each 16x 16-pixel block we compute the first 16 Zemike 
moments, which represent the associated rotation-invariant 
feature. Based on these features, we then build the offset 
held by means of PatchMatch, and Anally use a suitable post¬ 
processing to extract possible near-duplicates. 

In Tab|IV| we report the average results (CPU-time and 
accuracy) observed on a small set of test images when using 
PatchMatch with random initialization (the original algorithm) 
and with the initialization provided by ROSANNA. Images 
were drawn from the FAU databas^^used in lIMI . and have an 
average size of 6 Mpixel. All of them have been manipulated 
by copy-moving some parts, either to replicate or to hide some 
objects. An example is show in FigfTSi 

With random initialization, eight iterations of PatchMatch 
are necessary to ensure convergence to an accurate offset held. 
As a consequence, the processing time is dominated by the 
matching phase. Using ROSANNA to initialize the offset held 
has some extra costs due to the need of creating the support 
data structure and to perform the ANN search. This latter is 
reduced through a suitable choice of the parameters, G = 

http://www5.cs.fau.de/ 



















IEEE TRANSACTIONS ON IMAGE PROCESSING 


II 


Task 

Random initialization 

ANN 

initialization 

Feature extraction 

6.19 

(4.8%) 

6.19 

(11.8%) 

Initialization 

- 

27.43 

(52.2%) 

PatchMatch 

100.19 

(84.9%) 

7.66 

(14.6%) 

Post-processing 

13.31 

(10.3%) 

11.22 

(21.4%) 

TOTAL CPU-time 

129.69 

52.50 

Average accuracy (FM) 

0.977 

0.971 


TABLE IV 

Average CPU-time and accuracy for copy-move forgery 

DETECTION ON 6 MPIXEL IMAGES. 


11, i? = 1, C = 1 and by moderate subsampling (1 : 3) x (1 : 
3). Although only a subset of the offsets are initialized, due to 
subsampling, this is more than enough to ensure convergence 
to a good offset field with only two iterations of PatchMatch. 
In addition, the random search phase of PatchMatch is skipped 
altogether, reducing the cost of a single iteration. Eventually, 
the extra cost of initialization is more than compensated by the 
saving in matching, leading to an overall 60% cut in CPU-time 
for the same accuracy. Fig[T^ shows in the bottom the output 
masks provided by the two version of the copy-move detector, 
which are barely distinguishable. 

V. Conclusions and future work 

We developed ROSANNA starting from the analysis of 
Fig|^ and noting that order statistics allow one to induce 
structure in otherwise unstructured data, so as to speed up 
ANN search. Under a different point of view, ROSANNA 
is just spherical FSH where, however, multiple alternative 
partitions of the space of directions are available. Experiments 
show ROSANNA to provide a state-of-the-art performance on 
unstructured data. Such good results are conhrmed also on 
real-world SIFT and MNIST data. However, when data are 
highly structured, some suitable data-dependent preprocessing 
is needed. In this work we provided a simple solution to 
this problem, but there is certainly much room for further 
improvements. Finally, we illustrated ROSANNA’s potential to 
address real-world image processing problems by considering 
the copy-move forgery detection problem. 

Due to its simple conception and implementation, 
ROSANNA may represent a precious tools in a wide range 
of image processing problems where nearest neighbor search 
is involved. Research is under way to declinate the same 
basic concepts in the context of large database image retrieval, 
dehning reliable proxy distances based on compact codes 
derived from order-statistics. 

Appendix A 

Basic results on order statistics 

Fet 

be a vector of i.i.d. zero-mean random variables with marginal 
pdf fx{x). The order statistics ..., X(^x) ^6 the random 


{K-i) (1) (i-1) 


X x-\-dx 

(K-j) (1) U-l-i) (1) (t-1) 


y y-\-dy X x-\-dx 

Fig. 14. Reference geometry for computing order statistics of the first (top) 
and second (bottom) order. In the examples K = 12, i = 3,j = 9. 

variables obtained by sorting the values of X in descendinj^ 
order. For example 

X(^i) = max(Xi,.. .,Xk) 

and 

^(K) = min(Xi,..., Xx) 

To obtain the marginal pdf of the Uth order statistics let us 
compute the probability 

Pr(X(i) e [a;, a; -I- dx]) = {x)dx 

where dx —> 0. For this event to happen, i — \ components of 
vector X must be larger than x + dx, K — i of them must be 
smaller than x, and exactly one of them must fall in the interval 
[x, X + dx] (see Fig[^ top). Since the components are i.i.d., 
by taking into account the different combinations through 
a multinomial coefficient, and neglecting 0{dx^) terms, it 
results 

K' 

1)1 « 5 ) 

X /x(i) 

All joint statistics can be computed in the same way. In 
particular, for the second-order joint pdf, taking i < j, it 
results (see FigfT4] bottom) 

K\ 

y) = - j)\{j -1 - i)\{i -1)\ 

X [Fx{y)f-^[Fx{x)-Fx{y)V-^-^ 

X [I - Fx{x)Y~^ fx{x)fx{y)u{x-y) 

Appendix B 
Theoretical bounds 

We want to characterize ROSANNA in terms of its collision 
probability for unstructured data. 

Fet 

X = {Xi,...,Xx} 

be a vector of i.i.d. zero-mean symmetric random variables 
with marginal pdf fx{x). Then, let 

Y = {Y^,...,Yk} 

^^More often, ascending order is used, but the two choices are equivalent 
and we prefer to remain coherent with ROSANNA’s functioning. 
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Fig. 15. Geometry of the problem in 2d. 




Fig. 16. Theoretical bounds and MonteCai'lo estimate for the crossing 
probability in the case of i.i.d. standard Gaussian RV’s, with K=2, G=\. 


be a second vector which, given X = x, is uniformly 
distributed on the hypersphere of radius r centered on x, that 

1 

/Y(y|x) = - x|| - r) 

where (r) is the measure of the K-d hypersphere of radius 
r, and 6{-) is the Dirac delta function. We want to compute, 
for any radius r, the probability that X and Y belong to the 
same cone 

Pconir) = Pr[/i(X) = h(Y)] 

where the hash function h{-) associates a vector to a given 
cone based on the index and sign of its G largest components. 
Rather than the probability of collision, we will consider its 
complement 

Pcross(r') = 1 Pcoll(r') 

which is the probability that point Y lies across any of the 
boundaries of X’s cone. 

A. The 2d case 

Let us begin by considering the simplest non-trivial case of 
K = 2 and G = 1. FigfTS] provides a pictorial description of 
our problem in this setting. In the example, point x belongs 
to cone Cl where the hrst component is the largest, and it is 
positive. In addition, we focus on the half-cone where also 
the second component of x is positive 

n = {x S : xi > a ;2 > 0} 

However, thanks to symmetry, all following arguments apply 
equally well to all other cones and half-cones with obvious 
modihcations. Therefore 

Pcross(r*) = Pcross(r'I H) = / /x|o(x)pcross(r | x)(ix (8) 

Jn 

Cone Cl has two boundaries, the lines with equations 1/2 = yi 
and y 2 = —yi- For each x in ci let us label the boundaries in 
order of increasing distance from x, so that 

Pcross.l(r I x) 

is the probability of crossing boundary number 1, the nearest 
one. Let A be the distance of x from this boundary. Since we 
restrict attention to this is 

A = {xi- X2 )Is/2 


while more in general it holds 

^ _ max(|a;i|, |a; 2 |) - min(|a:i|, |a: 2 |) 

72 

If A < r, part of the circumference of radius r centered on x 
will cross the nearest boundary. For our hypothesis of uniform 
distribution of Y|x, the fraction of the circumference past the 
boundary represents the probability of crossing it 


Pcross,l(T’ I x) 


0(r, A)I'k a < r 
0 otherwise 


(9) 


where 


6*(r, A) = arccos 



( 10 ) 


Of course, other points on the circumference may cross the 
second boundary, and some may cross both. Therefore 0 
is only a lower bound to Pcross(?’|x). A good upper bound 
is obtained by the sum Pcross,i(r|x) + _Pcross, 2 (r|x), which 
may be difficult to compute. Simpler but looser bounds are 
2_Pcross,i(r|x) and 1 — 1/Nc, with Nc the total number of 
cones. In summary 


Pcross,l(»'|x)< Pcross7|x) (11) 

^ ■ fo / \ \ 

< min I 2pcross.i(r|x), 

and by averaging over X G H we obtain lower and upper 
bounds for Pcross(r). 

In Fig[T^ we plot these upper and lower bounds, together 
with the actual crossing probability estimated through Monte- 
Carlo simulation, when the X^’s are standard Gaussian RV’s. 
As expected, for small values of r the MonteCarlo estimate 
is very close to the theoretical lower bound (see also the log- 
scale plot on the right), while the gap grows when the distance 
between the two points, X and Y, becomes comparable with 
their own norm. When r —> oo, of course, Y belongs to any 
cone with the same probability, and the curve approaches the 
theoretical upper bound of 3/4. 


B. The general case 

We now consider the general high-dimensional case, pro¬ 
ceeding in the very same way as for the 2d case, except for 
some suitable modihcations. The most important difference 
with respect to the previous case is that we will resort to 
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order statistics to reduce the final i^-dimensional integral to a 
numerically tractable 2d integral. 

In this case we have Nc = cones, statistically 

indistinguishable from one another. Let us consider the cone 
Cl where the first G components are also the largest, and they 
are all positive. Furthermore, let us restrict attention to the 
subregion of ci, call it again 11, where also the smallest K — G 
components are all positive 

n = {x G :min(xi,..., xq) > max(a:G+i, ■ ■ •, xk), 
x^>0, i = (12) 


Again, because of symmetry, this restriction is immaterial, and 
(j^ still holds. So, we will provide lower and upper bounds 
for pcross(f’|x), and then integrate over 11. 

Cone Cl is now delimited by 2G{K — G) hyperplanes, the 
hyper-planes with equations yi — yj and yi = —yj, for all 
i = and j — G + 1,... ,K, and the point Y at 

distance r from x leaves the cone only if it crosses at least 
one of such boundaries. 

Let again A be the distance of x from the closest boundary. 
If A > r the probability of crossing that boundary (or any 
other) is 0. Otherwise, it can be computed as the ratio between 
the measure of the hyper-spherical cap intercepted by 

a hyperplane at distance A from the center, and the measure 
S^{r) of the whole hyper-sphere. It is known that 


S^{r) 


r(f) 


with r( ) the Gamma function, while the measure of the cap 
can be computed by integrating over 9 the (iL—l)-dimensional 
hyper-spheres of radius r sin(0) 


= f 

Jo 


e(r,A) 


S^-\rsin{9))rd9 


where 9{r,A) is still given by (101. As for A, note that the 
closest boundary to x is the hyperplane of equation y^ = yM 
where m is the index of the smallest component among the G 
largest, and M is the index of the largest component among 
the K — G smallest. Consequently 


A = 


V2 


In summary it results 

Pcross,l(^ I x) = 


S^’'^ir)/S^{r) 

0 


A < r 
otherwise 




Fig. 17. Theoretical bounds and MonteCarlo estimate for the crossing 
probability in the case of i.i.d. standard Gaussian RV’s, with K=i6, G=4. 


dimensional integral, since Pcross.i(’' I x) depends only on the 
G-th and (G+ l)-st largest components of x through A, that 
is. 


/x|n(x)Pcross,l(?’ I 


x)dx = 


/X(G)A(G+l)|f2 («:/?)? cross,1 {r\a,^)d^da 


JO JO 

where the ’s are the order statistics obtained by sorting X 
in descending order (remember that in O this coincides with 
sorting the vector for descending absolute values). Therefore 
we only need the joint pdf of X(^c) A(( 3 _|_i), which is 

given by JtIi. 

In Figl^ we plot these upper and lower bounds, together 
with the actual crossing probability estimated through Monte- 
Carlo simulation, when the X^’s are standard Gaussian RV’s, 
for a single high-dimensional case, with K=16 and G=4. 
Again, for small values of r the MonteCarlo estimate is very 
close to the theoretical lower bound. The upper bound, instead, 
is too loose to be of practical guide. 

These theoretical results confirm the correctness of the 
proposed algorithm. Close points tend to be hashed in the 
same cell, and the collision probability approaches 1 as the 
distance goes to 0. In addition, they can be used to guide 
the choice of the algorithm parameters. One could compute 
upper and lower bounds for each value of K and G, and 
choose the combination of parameters that better meets the 
problem requirements. It is worth reminding that these results 
hold rigorously only for the Gaussian i.i.d. case, and have been 
obtained with reference to a simple version of the algorithm, 
with a single basis and no multi-probe. Nonetheless, they 
represent a conceptual support to practical design. 


Again, Pcross,i(f |x) is a lower bound forpcross(f |x). An upper 
bound can be easily obtained as 

2G{K-G) 

PcYoss{x I x) ^ ^ ^ Pcross,i(^|x) 

i=l 

< min ^2G(A: - G)pcross.i(»'|x), 

As the last step of our development, we must compute the 
integral ([^ over X G fl which, for AT ^ 1, is computationally 
intractable. However, we are not really interested in the K- 
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